This project examines economic trends in Major League Baseball. Data sources include the Lahman database (for salaries, performance, and other info), Baseball Reference (for Wins Above Replacement), Forbes (for financials), the US Census Bureau (for US household income), and the Bureau of Labor Statistics (for Consumer Price Index). More details about data sources are on the project website (https://sites.google.com/view/bst260mlm). The data exploration includes trends in player salary, team finances, team performance, and the relationship between salary/spending and player/team performance.
Note that salary information is only available from 1985 forward.
min(salaries$yearID)
## [1] 1985
We can observe salary trends over time using player salary data from the Lahman dataset, and US household salary data.
Using the respective 1985 salaries as a baseline, we see that US median household income has increased by around 25% every 5 years. Growth in median player salary has generally outpaced the salary growth of US households until the past decade. Average salary, in contrast, has increased at a greater rate since 1990. This suggests that salary trends for “top” players may be different from that of “typical” players. Let’s see if this is true.
Observing the trends by percentile, we can see that, indeed, salaries have not increased at the same rate across the board, with large gains for the top paid players, with little change for players in the lower percentiles. We will now look at a comparison of percentiles by team.
On a team-by-team basis, we can see that salary growth was not the same between the median and the 90th salary percentiles. In particular, there was a decreasing trend in median salary between 1991 and 1995, where the 90th percentile did not change. What may have caused this?
Using the same data to plot the ratio of the 90th percentile salary to the median, we can see an increasing trend until 1995, the year after the baseball strike of 1994, then a decreasing trend afterwards. Again, percentiles were calculated by team.
Is this a real trend? Let’s check with a piecewise linear spline regression (2 methods).
##
## Call:
## lm(formula = p90rat ~ yearID + yearID * (yearID > 1994.5), data = salteam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4831 -2.6670 -0.7435 1.4938 21.7437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.404e+03 1.629e+02 -8.620 <2e-16 ***
## yearID 7.081e-01 8.185e-02 8.652 <2e-16 ***
## yearID > 1994.5TRUE 1.528e+03 1.697e+02 9.005 <2e-16 ***
## yearID:yearID > 1994.5TRUE -7.664e-01 8.526e-02 -8.989 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.847 on 914 degrees of freedom
## Multiple R-squared: 0.1634, Adjusted R-squared: 0.1607
## F-statistic: 59.5 on 3 and 914 DF, p-value: < 2.2e-16
##
## ***Regression Model with Segmented Relationship(s)***
##
## Call:
## segmented.lm(obj = splfit2, seg.Z = ~yearID, psi = 1994)
##
## Estimated Break-Point(s):
## Est. St.Err
## 1994.396 0.723
##
## Meaningful coefficients of the linear terms:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.404e+03 1.629e+02 -8.620 <2e-16 ***
## yearID 7.081e-01 8.185e-02 8.652 <2e-16 ***
## U1.yearID -7.664e-01 8.526e-02 -8.989 NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.847 on 914 degrees of freedom
## Multiple R-Squared: 0.1634, Adjusted R-squared: 0.1607
##
## Convergence attained in 2 iterations with relative change 0
Looks like there are different trends before and after the strike. Modeling a discontinuous regression also illustrates this break.
##
## Call:
## RDestimate(formula = p90rat ~ yearID, data = salteam, cutpoint = 1994.5)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value
## LATE 6.049 334 2.872 1.0894 2.636
## Half-BW 3.024 166 4.335 1.6445 2.636
## Double-BW 12.097 618 1.708 0.7535 2.266
## Pr(>|z|)
## LATE 0.008394 **
## Half-BW 0.008395 **
## Double-BW 0.023436 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 34.93 3 330 0.000e+00
## Half-BW 11.26 3 162 1.901e-06
## Double-BW 74.99 3 614 0.000e+00
Following the strike of 1994, a two-pronged plan was put in place to limit the effect of reduced revenues. First was to enact revenue-sharing in order to keep smaller teams from struggling financially. The other was a luxury tax, which was a way to limit the spending of the big teams without explicitly enforcing a salary cap.
Let’s look at how team/franchise revenue has changed over time. Data comes from Michael Ozanian, a writer for Forbes, who has been compiling annual financial data for Major League Baseball since 1990. Numbers have been adjusted to 1990 dollars using the Consumer Price Index (CPI).
We see that there is a drop in revenues in the year of the stike, but this recovers after a few years. Note that the 2012 revenue data is incomplete and skewed upwards. What about franchise values?
We can see that the strike did not drastically affect franchise values. Soon after the strike, as revenues recovered, valuations rose greatly after the revenue sharing agreement, and again after 2010 (note that the 2012 valuation data is complete). Let’s look back at revenues. How did revenues for each team change from year to year?
We can see that teams mostly have increasing revenues year-on-year, with the biggest increases in 1997 and 2014. The other goal of the revenue sharing agreement was to prevent losses. Was this accomplished? We will look at this information over time, stratifying by Collective Bargaining Agreement (CBA), the contract between team owners that sets the financial rules of the league.
We see that the percentage of teams suffering either an operating loss or loss in franchiase value has decreased over time, particularly following the enactment of CBA10. Is this a real trend? Let’s check.
##
## Call:
## lm(formula = teams ~ Year, data = oloss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28988 -0.10506 0.01310 0.05647 0.29085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.710471 7.714400 3.851 0.000767 ***
## Year -0.014688 0.003852 -3.813 0.000845 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1473 on 24 degrees of freedom
## Multiple R-squared: 0.3772, Adjusted R-squared: 0.3513
## F-statistic: 14.54 on 1 and 24 DF, p-value: 0.0008448
##
## Call:
## lm(formula = teams ~ Year, data = deval)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.279949 -0.084264 -0.006249 0.103421 0.230952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.617231 8.244450 3.956 0.000627 ***
## Year -0.016183 0.004116 -3.932 0.000666 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1484 on 23 degrees of freedom
## Multiple R-squared: 0.402, Adjusted R-squared: 0.376
## F-statistic: 15.46 on 1 and 23 DF, p-value: 0.0006664
The linear models suggest a decreasing trend, with the percentage of teams suffering operating/valuation loss decreasing by ~1.5% every year, for a ~30 percentage point decrease over 20 years (not using compounding). Let’s look at differences between the different CBA periods.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cba) 4 0.3586 0.08966 6.016 0.0024 **
## Residuals 20 0.2981 0.01490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = teams ~ as.factor(cba), data = oloss)
##
## $`as.factor(cba)`
## diff lwr upr p adj
## 8-7 0.04963370 -0.1740707 0.27333810 0.9618847
## 9-7 -0.09322344 -0.3515350 0.16508815 0.8145533
## 10-7 -0.27655678 -0.5348684 -0.01824518 0.0322181
## 11-7 -0.17322344 -0.4182793 0.07183245 0.2523730
## 9-8 -0.14285714 -0.3665615 0.08084726 0.3437435
## 10-8 -0.32619048 -0.5498949 -0.10248607 0.0024846
## 11-8 -0.22285714 -0.4311146 -0.01459968 0.0323258
## 10-9 -0.18333333 -0.4416449 0.07497826 0.2489516
## 11-9 -0.08000000 -0.3250559 0.16505589 0.8624128
## 11-10 0.10333333 -0.1417226 0.34838923 0.7164209
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cba) 4 0.4607 0.11519 6.28 0.00214 **
## Residuals 19 0.3485 0.01834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = teams ~ as.factor(cba), data = deval)
##
## $`as.factor(cba)`
## diff lwr upr p adj
## 8-7 -0.31607143 -0.5917918 -0.04035104 0.0200669
## 9-7 -0.40714286 -0.7181974 -0.09608831 0.0069843
## 10-7 -0.34047619 -0.6515307 -0.02942164 0.0278618
## 11-7 -0.47714286 -0.7745679 -0.17971782 0.0009916
## 9-8 -0.09107143 -0.3404699 0.15832705 0.8053502
## 10-8 -0.02440476 -0.2738032 0.22499371 0.9982114
## 11-8 -0.16107143 -0.3932488 0.07110592 0.2660313
## 10-9 0.06666667 -0.2213139 0.35464722 0.9548525
## 11-9 -0.07000000 -0.3432023 0.20320234 0.9360068
## 11-10 -0.13666667 -0.4098690 0.13653568 0.5721522
The Tukey multiple comparisons show that valuation losses were highest during CBA7 and significantly lower for all periods thereafter. The period differences in operating losses are less clear, although CBA10 is significantly lower than CBA7 and CBA8.
sources:
https://www.sbnation.com/2010/8/30/1065675/8-30-2002-baseball-avoids-another https://www.fangraphs.com/tht/a-history-of-the-collective-bargaining-agreement-part-3/
The analysis suggests that the measures taken in successive CBAs since the 1994 strike have improved the financial sitation of the league. Did this affect spending on players?
We can see that the standardized distribution of payrolls across the league remained fairly consistent until after 2001, after which there was one team (the Yankees, obviously) with a much higher spend for several years.
Looking at percentage of revenue spent, it appears that player payrolls have been around 50% of franchise revenue for the past 20 years, with the exception of the strike year and year immediately following, where revenues took a hit.
As a measure of player performance, we will use Wins Above Replacement (WAR). The calculation is very involved and not standardized, so the WAR numbers were taken from the Baseball Refernce database (not the same as the Lahman Database).
First, let’s look at how WAR distribution across the league has changed over time.
We can see that WAR variance decreases over time, and stays low after 1996. This could be due to the revenue-sharing agreements, or perhaps due to teams being better about picking players. Regardless, it would seem that the post-strike era is more competitive than before the strike. Let’s explore this further.
First, let’s look at median WAR by team, sorted by summed WAR prior to the year 2000.
From the heat map, it seems that the teams that consistently had better players before the strike (marked by the dotted white line) lost that advantage after the strike. Let’s look at the better players (75th percentile) in each team.
The change is less evident here. Let’s see if this affected team performance.
There is no disernable pattern here. No team is overly dominant in the league, either before or after the strike. Case in point: Houston, which above is visibly the worst team between 2010-2015, just won the World Series.
What kind of players were paid more? Let’s look at the year 1995.
The above visualization sounded like a better idea than it turned out to be. Let’s look at a table instead.
## Name Team Salary
## 1 Cecil Fielder Tigers $9.2
## 2 Barry Bonds Giants $8.2
## 3 David Cone Blue Jays $8
## 4 Ken Griffey Mariners $7.6
## 5 Frank Thomas White Sox $7.2
## 6 Jeff Bagwell Astros $6.9
## 7 Mark McGwire Athletics $6.9
## 8 Cal Ripken Orioles $6.7
## 9 Greg Maddux Braves $6.5
## 10 Kirby Puckett Twins $6.3
## 11 Lenny Dykstra Phillies $6.2
## 12 Barry Larkin Reds $5.9
## 13 Jose Canseco Red Sox $5.8
## 14 Bret Saberhagen Mets $5.6
## 15 Gary Sheffield Marlins $5.6
## 16 Will Clark Rangers $5.6
## 17 Jack McDowell Yankees $5.4
## 18 Darryl Strawberry Dodgers $5.3
## 19 Mark Langston Angels $5
## 20 Larry Walker Rockies $5
## 21 Greg Vaughn Brewers $4.9
## 22 Wally Joyner Royals $4.8
## 23 Tony Gwynn Padres $4.7
## 24 Dennis Martinez Indians $4.6
## 25 Ken Hill Cardinals $4.5
## 26 Jay Bell Pirates $4.4
## 27 Mark Grace Cubs $4.4
## 28 Moises Alou Expos $3
The max salaries for each team went to future Hall of Famers Ken Griffey Jr., Barry Larkin, Kirby Puckett, and Cal Ripken, Jr., as well as other popular players of the time, including Barry Bonds, Jose Canseco, Mark McGwire, and Daryl Strawberry. So, it seems that salaries went to popular players, who were presumably also the players who played well.
Here’s a word cloud, with sizes/color corresponding to salary:
Moving on.
[Note that I left in a lot of data exploration in this section, so it will be a bit messier than above.]
Do better players make more? Let’s first look at how WAR may have influenced changes in salary.
#Previously calculated salary changes and performance changes year-to-year and saved in warplus.csv
warstats <- wars %>% filter(pitcher=="N")
#warstats <- wars
warstats <- warstats[c("playerID","yearID","teamID",
"WAA","WAA_off","WAA_def",
"WAR","WAR_off","WAR_def",
"prevWAA","prevWAA_off","prevWAA_def",
"prevWAR","prevWAR_off","prevWAR_def",
"dWAA","dWAA_off","dWAA_def",
"dWAR","dWAR_off","dWAR_def",
"OPS_plus","salary","bump","cut","dsal","pdsal")]
warstats <- warstats %>% filter(yearID > 1985)
warstats <- warstats %>% filter(dsal!=0)
warstats <- warstats %>% filter(abs(dsal)>1)
warcut <- warstats %>% filter(dsal<0)
warcut %>% ggplot() + geom_histogram(aes(as.numeric(prevWAR)),binwidth=0.05) +
labs(title="Players who received a salary decrease",x="Wins Above Replacement")
warbump <- warstats %>% filter(dsal>0)
warbump %>% ggplot() + geom_histogram(aes(as.numeric(prevWAR)),binwidth=0.05) +
labs(title="Players who received a salary increase",x="Wins Above Replacement")
All of the performance stats (including Wins Above Average (WAA), offensive and defensive stats, and previous and current performance/salary) listed in the above chunk of code were tested. No obvious trend was there. Let’s look at other trends.
warstats$pdsal[which(warstats$pdsal==Inf)] = NA
warstats %>% filter(pdsal < 40, dsal > -100000) %>% ggplot() +
geom_point(aes(prevWAA_off,dsal + abs(min(dsal))+1 )) +
geom_smooth(aes(prevWAA_off,dsal + abs(min(dsal))+1 )) +
scale_y_continuous(trans="log10") +
labs(x="WAA in previous year", y="Salary change ($, log millions)")
warstats %>% ggplot() +
geom_point(aes(prevWAA_off,dsal/1000000)) +
geom_smooth(aes(prevWAA_off,dsal/1000000)) +
scale_y_continuous(breaks=seq(-30,30,5)) +
labs(x="WAA in previous year", y="Salary change ($, millions)")
warstats %>% ggplot() +
geom_point(aes(prevWAR_off,dsal/1000000)) +
geom_smooth(aes(prevWAR_off,dsal/1000000)) +
scale_y_continuous(breaks=seq(-30,30,5)) +
labs(x="WAR in previous year", y="Salary change ($, millions)")
warstats %>% ggplot() +
geom_point(aes(prevWAR,salary/1000000)) +
geom_smooth(aes(prevWAR,salary/1000000)) +
scale_y_continuous(breaks=seq(-30,30,5)) +
labs(x="WAR in previous year", y="Current salary ($, millions)")
There may be an increasing trend. Let’s look at a regression.
summary(lm(warstats$salary~warstats$prevWAR))
##
## Call:
## lm(formula = warstats$salary ~ warstats$prevWAR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8716085 -1365840 -762866 220743 27749911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1133504 34353 33.0 <2e-16 ***
## warstats$prevWAR 752798 14847 50.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3103000 on 12132 degrees of freedom
## Multiple R-squared: 0.1749, Adjusted R-squared: 0.1748
## F-statistic: 2571 on 1 and 12132 DF, p-value: < 2.2e-16
There is a significant effect, but the R-squared is not convincing. What about for a single year?
warstats %>% filter(yearID==1991) %>% ggplot() +
geom_point(aes(prevWAR,salary/1000000)) +
geom_smooth(aes(prevWAR,salary/1000000)) +
scale_y_continuous(breaks=seq(-30,30,5),trans="log10") +
labs(x="WAR in previous year", y="Salary change ($, millions)")
w2 <- warstats %>% filter(yearID==1991)
summary(lm(w2$dsal~w2$prevWAR))
##
## Call:
## lm(formula = w2$dsal ~ w2$prevWAR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1318664 -198423 -80652 88553 2004829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161395 26878 6.005 4.81e-09 ***
## w2$prevWAR 135519 11234 12.063 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407100 on 349 degrees of freedom
## Multiple R-squared: 0.2943, Adjusted R-squared: 0.2922
## F-statistic: 145.5 on 1 and 349 DF, p-value: < 2.2e-16
Sure enough, this looks a bit more convincing. Let’s look at this over several years.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cba) 4 0.22894 0.05724 36.4 4.39e-10 ***
## Residuals 25 0.03932 0.00157
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = adj.r.squared ~ as.factor(cba), data = regres)
##
## $`as.factor(cba)`
## diff lwr upr p adj
## 8-7 0.06186162 0.003628854 0.120094394 0.0333907
## 9-7 -0.01380545 -0.085125738 0.057514835 0.9784690
## 10-7 -0.13789710 -0.209217387 -0.066576814 0.0000596
## 11-7 -0.16410402 -0.227002599 -0.101205438 0.0000005
## 9-8 -0.07566708 -0.146987362 -0.004346789 0.0336931
## 10-8 -0.19975872 -0.271079011 -0.128438438 0.0000001
## 11-8 -0.22596564 -0.288864223 -0.163067062 0.0000000
## 10-9 -0.12409165 -0.206445222 -0.041738075 0.0014244
## 11-9 -0.15029857 -0.225476750 -0.075120384 0.0000367
## 11-10 -0.02620692 -0.101385102 0.048971265 0.8420085
We see that the correlation between current salary and WAR in the previous year changes over time, with greater correlation during CBA6-CBA9, and less thereafter.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cba) 4 0.14687 0.03672 33.12 1.18e-09 ***
## Residuals 25 0.02772 0.00111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = adj.r.squared ~ as.factor(cba), data = regres)
##
## $`as.factor(cba)`
## diff lwr upr p adj
## 8-7 0.1071798491 0.05828624 0.15607346 0.0000090
## 9-7 0.0114611309 -0.04842107 0.07134333 0.9793432
## 10-7 -0.0748273792 -0.13470958 -0.01494518 0.0092462
## 11-7 -0.0740462640 -0.12685740 -0.02123513 0.0030796
## 9-8 -0.0957187182 -0.15560092 -0.03583652 0.0007211
## 10-8 -0.1820072283 -0.24188943 -0.12212503 0.0000000
## 11-8 -0.1812261131 -0.23403725 -0.12841497 0.0000000
## 10-9 -0.0862885101 -0.15543452 -0.01714250 0.0093555
## 11-9 -0.0855073949 -0.14862878 -0.02238601 0.0043512
## 11-10 0.0007811152 -0.06234027 0.06390250 0.9999996
Looking at salary and performance within the same year, we see that the highest correlation was during CBA8, though there were significant differences between periods.
Why would this correlation change with the collective bargaining agreements? Perhaps it had to do with luxury tax. Or perhaps teams simply got better at paying players less. Perhaps WAR doesn’t tell the whole story.
Now let’s look at how salary may have affected team performances, in successive five-year periods. First, win percentage against salary, with salary standardized by year.
#yearly standard
require(gridExtra)
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)
plots = list()
for(i in 1:6){
tmpsal <- salteam3 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
tmpsal <- tmpsal %>% mutate(stdsal = (totsal - mutoty)/sdtoty)
wlres <- tmpsal %>% lm(wpct ~ stdsal,data=.)
print(summary(wlres))
label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wpct)) + geom_smooth(aes(stdsal,wpct)) + labs(title=label,x="Standardized Salary",y="Win Percentage")
newvar = paste("myplot",i,sep="")
assign(newvar,myplot)
#print(myplot)
}
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.170968 -0.041127 0.003113 0.043322 0.154448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499950 0.006221 80.363 <2e-16 ***
## stdsal 0.011004 0.006344 1.734 0.0859 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06344 on 102 degrees of freedom
## Multiple R-squared: 0.02865, Adjusted R-squared: 0.01913
## F-statistic: 3.008 on 1 and 102 DF, p-value: 0.08585
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14919 -0.05129 -0.01026 0.05134 0.17869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.500060 0.006151 81.299 < 2e-16 ***
## stdsal 0.017996 0.006268 2.871 0.00494 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06392 on 106 degrees of freedom
## Multiple R-squared: 0.07215, Adjusted R-squared: 0.0634
## F-statistic: 8.243 on 1 and 106 DF, p-value: 0.00494
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16329 -0.04518 0.00182 0.03722 0.14106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499952 0.005431 92.05 < 2e-16 ***
## stdsal 0.039849 0.005527 7.21 6.57e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0585 on 114 degrees of freedom
## Multiple R-squared: 0.3132, Adjusted R-squared: 0.3071
## F-statistic: 51.98 on 1 and 114 DF, p-value: 6.57e-11
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20606 -0.05451 0.01146 0.05256 0.20215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.500000 0.006899 72.477 < 2e-16 ***
## stdsal 0.036670 0.007017 5.226 7.57e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07557 on 118 degrees of freedom
## Multiple R-squared: 0.188, Adjusted R-squared: 0.1811
## F-statistic: 27.31 on 1 and 118 DF, p-value: 7.566e-07
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.145329 -0.041975 0.002131 0.041144 0.134287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499977 0.005218 95.812 < 2e-16 ***
## stdsal 0.029380 0.005308 5.536 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05716 on 118 degrees of freedom
## Multiple R-squared: 0.2061, Adjusted R-squared: 0.1994
## F-statistic: 30.64 on 1 and 118 DF, p-value: 1.904e-07
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.149096 -0.038805 0.007312 0.052624 0.114660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499994 0.006038 82.801 < 2e-16 ***
## stdsal 0.021162 0.006142 3.446 0.00079 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06615 on 118 degrees of freedom
## Multiple R-squared: 0.09141, Adjusted R-squared: 0.08371
## F-statistic: 11.87 on 1 and 118 DF, p-value: 0.0007899
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)
Next, win percentage against salary, with salary standardized over each 5-year period.
#5 year standard
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)
for(i in 1:6){
tmpsal <- salteam2 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
meansdsal <- tmpsal %>% summarize(meansal = mean(totsal), sdsal = sd(totsal))
tmpsal <- tmpsal %>% mutate(stdsal = (totsal - meansdsal$meansal)/meansdsal$sdsal)
wlres <- tmpsal %>% lm(wpct ~ stdsal,data=.)
print(summary(wlres))
label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wpct)) + geom_smooth(aes(stdsal,wpct)) + labs(title=label,x="Standardized Salary",y="Win Percentage")
newvar = paste("myplot",i,sep="")
assign(newvar,myplot)
#print(myplot)
}
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.172902 -0.040316 0.003635 0.042368 0.154907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499829 0.006214 80.440 <2e-16 ***
## stdsal 0.010701 0.005914 1.809 0.0734 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06336 on 102 degrees of freedom
## Multiple R-squared: 0.03109, Adjusted R-squared: 0.0216
## F-statistic: 3.273 on 1 and 102 DF, p-value: 0.07335
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.148902 -0.047993 0.000505 0.046672 0.163294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.498574 0.005951 83.773 < 2e-16 ***
## stdsal 0.019611 0.004829 4.061 9.39e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06173 on 106 degrees of freedom
## Multiple R-squared: 0.1346, Adjusted R-squared: 0.1265
## F-statistic: 16.49 on 1 and 106 DF, p-value: 9.39e-05
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15143 -0.04235 0.00053 0.04179 0.16215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.496562 0.005639 88.054 < 2e-16 ***
## stdsal 0.028600 0.004450 6.427 3.13e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06047 on 114 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.2595
## F-statistic: 41.31 on 1 and 114 DF, p-value: 3.129e-09
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20521 -0.05467 0.01017 0.05316 0.20306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499105 0.006890 72.44 < 2e-16 ***
## stdsal 0.036616 0.006948 5.27 6.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07545 on 118 degrees of freedom
## Multiple R-squared: 0.1905, Adjusted R-squared: 0.1837
## F-statistic: 27.77 on 1 and 118 DF, p-value: 6.242e-07
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.146966 -0.040119 0.005833 0.039018 0.133624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499374 0.005238 95.334 < 2e-16 ***
## stdsal 0.028332 0.005209 5.439 2.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05737 on 118 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.1937
## F-statistic: 29.59 on 1 and 118 DF, p-value: 2.938e-07
##
##
## Call:
## lm(formula = wpct ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.153625 -0.041106 0.005691 0.053141 0.114145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.499030 0.006087 81.988 < 2e-16 ***
## stdsal 0.017940 0.005633 3.185 0.00185 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06659 on 118 degrees of freedom
## Multiple R-squared: 0.07916, Adjusted R-squared: 0.07136
## F-statistic: 10.14 on 1 and 118 DF, p-value: 0.001852
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)
Next, league rank against salary, with salary standardized over each 5-year period.
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)
for(i in 1:6){
tmpsal <- salteam2 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
meansdsal <- tmpsal %>% summarize(meansal = mean(totsal), sdsal = sd(totsal))
tmpsal <- tmpsal %>% mutate(stdsal = (totsal - meansdsal$meansal)/meansdsal$sdsal)
wlres <- tmpsal %>% lm(Rank ~ stdsal,data=.)
print(summary(wlres))
label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,Rank)) + geom_smooth(aes(stdsal,Rank)) + labs(title=label,x="Standardized Salary",y="Rank")
newvar = paste("myplot",i,sep="")
assign(newvar,myplot)
#print(myplot)
}
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2970 -1.6387 -0.0339 1.6072 3.6072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7345 0.1822 20.500 <2e-16 ***
## stdsal -0.3294 0.1734 -1.899 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.858 on 102 degrees of freedom
## Multiple R-squared: 0.03416, Adjusted R-squared: 0.02469
## F-statistic: 3.608 on 1 and 102 DF, p-value: 0.06033
##
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1544 -1.4275 -0.1564 1.2422 4.0194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5734 0.1689 21.161 < 2e-16 ***
## stdsal -0.4798 0.1370 -3.502 0.000678 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 106 degrees of freedom
## Multiple R-squared: 0.1037, Adjusted R-squared: 0.09523
## F-statistic: 12.26 on 1 and 106 DF, p-value: 0.0006778
##
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4576 -0.9606 -0.0107 0.9254 3.6188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.00601 0.12036 24.976 < 2e-16 ***
## stdsal -0.48703 0.09497 -5.128 1.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.291 on 114 degrees of freedom
## Multiple R-squared: 0.1874, Adjusted R-squared: 0.1803
## F-statistic: 26.3 on 1 and 114 DF, p-value: 1.21e-06
##
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7160 -1.1112 0.1837 0.9506 2.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0492 0.1218 25.027 < 2e-16 ***
## stdsal -0.6499 0.1229 -5.289 5.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.334 on 118 degrees of freedom
## Multiple R-squared: 0.1917, Adjusted R-squared: 0.1848
## F-statistic: 27.98 on 1 and 118 DF, p-value: 5.729e-07
##
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68509 -1.13595 0.01056 1.02661 3.16204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0282 0.1241 24.394 < 2e-16 ***
## stdsal -0.5436 0.1234 -4.404 2.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.36 on 118 degrees of freedom
## Multiple R-squared: 0.1412, Adjusted R-squared: 0.1339
## F-statistic: 19.39 on 1 and 118 DF, p-value: 2.351e-05
##
##
## Call:
## lm(formula = Rank ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4240 -1.1755 -0.1448 0.9357 2.6941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0421 0.1295 23.496 < 2e-16 ***
## stdsal -0.3180 0.1198 -2.654 0.00905 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.417 on 118 degrees of freedom
## Multiple R-squared: 0.05633, Adjusted R-squared: 0.04834
## F-statistic: 7.044 on 1 and 118 DF, p-value: 0.009048
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)
Next, win-loss ratio against salary, with salary standardized over each 5-year period.
#yearly standard
date1 = seq(1985,2010,5)
date2 = seq(1989,2014,5)
dates = data.frame(date1,date2)
for(i in 1:6){
tmpsal <- salteam3 %>% filter(yearID > dates$date1[[i]] & yearID <= dates$date2[[i]])
tmpsal <- tmpsal %>% mutate(stdsal = (totsal - mutoty)/sdtoty)
wlres <- tmpsal %>% lm(wlratio ~ stdsal,data=.)
print(summary(wlres))
label = paste(dates$date1[[i]],dates$date2[[i]],sep="-")
myplot <- tmpsal %>% ggplot() + geom_point(aes(stdsal,wlratio)) + geom_smooth(aes(stdsal,wlratio)) + labs(title=label,x="Standardized Salary",y="Win-Loss Ratio")
newvar = paste("myplot",i,sep="")
assign(newvar,myplot)
#print(myplot)
}
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55857 -0.17972 -0.01983 0.15692 0.90961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03342 0.02611 39.586 <2e-16 ***
## stdsal 0.05109 0.02662 1.919 0.0578 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2662 on 102 degrees of freedom
## Multiple R-squared: 0.03485, Adjusted R-squared: 0.02539
## F-statistic: 3.683 on 1 and 102 DF, p-value: 0.05777
##
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51801 -0.21742 -0.07618 0.19493 0.93268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03698 0.02658 39.011 < 2e-16 ***
## stdsal 0.07270 0.02709 2.684 0.00845 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2762 on 106 degrees of freedom
## Multiple R-squared: 0.06362, Adjusted R-squared: 0.05479
## F-statistic: 7.202 on 1 and 106 DF, p-value: 0.00845
##
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52790 -0.17900 -0.02483 0.14699 1.05555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.04262 0.02438 42.758 < 2e-16 ***
## stdsal 0.17596 0.02482 7.091 1.19e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2626 on 114 degrees of freedom
## Multiple R-squared: 0.3061, Adjusted R-squared: 0.3
## F-statistic: 50.28 on 1 and 114 DF, p-value: 1.193e-10
##
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62279 -0.21269 -0.00028 0.18769 1.40373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.05802 0.02965 35.684 < 2e-16 ***
## stdsal 0.15825 0.03016 5.248 6.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3248 on 118 degrees of freedom
## Multiple R-squared: 0.1892, Adjusted R-squared: 0.1823
## F-statistic: 27.54 on 1 and 118 DF, p-value: 6.883e-07
##
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52235 -0.15494 -0.01333 0.16523 0.61128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03274 0.02112 48.897 < 2e-16 ***
## stdsal 0.12556 0.02148 5.845 4.6e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2314 on 118 degrees of freedom
## Multiple R-squared: 0.2245, Adjusted R-squared: 0.2179
## F-statistic: 34.16 on 1 and 118 DF, p-value: 4.596e-08
##
##
## Call:
## lm(formula = wlratio ~ stdsal, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4709 -0.1866 -0.0040 0.2036 0.5325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03796 0.02447 42.412 < 2e-16 ***
## stdsal 0.08536 0.02489 3.429 0.000834 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2681 on 118 degrees of freedom
## Multiple R-squared: 0.09063, Adjusted R-squared: 0.08292
## F-statistic: 11.76 on 1 and 118 DF, p-value: 0.0008343
grid.arrange(myplot1,myplot2,myplot3,myplot4,myplot5,myplot6,ncol=3)
The above analyses seem to indicate better win rates and ranks (with 1 being best) as more money is spent. The correlations vary by period, however, so let’s do another year-by-year breakdown.
Average salary seems to have a greater effect than total salary, though both have similar trends. The greatest correlation is during CBA8. Let’s see about the effect.
Even at its greatest, the effect is relatively small, with about a 4 percent increase in win percentage for a 1 standard deviation increase in average salary in a given year. How about league rank?
The effect size is similarly underwhelming, with a maximum of a ~0.9 increase in league rank for every standard deviation increase in average salary.
The correlation between average salary and spending is generally weak, with most adjusted R-squared values below 0.1.
Overall, the effect of spending more money seems to be small, but perhaps those small margins, combined with other management factors, are enough to translate into a competitive advantage, however transient.
In summary, we have seen that:
(1) Player salaries have increased over time, but the top rates have done so faster than the lower ones.
(2) Financial measures taken following the 1994 strike and in successive CBAs seem to have stabilized revenue streams for individual franchises and the overall growth of the league.
(3) Teams that were previously dominant on paper are less so today, but Major League Baseball remains as competitive as ever.
(4) Paying more money has a small effect on player and team performance, which has decreased over time.